Combination of Arabic Preprocessing Schemes for Statistical Machine Translation
نویسندگان
چکیده
Statistical machine translation is quite robust when it comes to the choice of input representation. It only requires consistency between training and testing. As a result, there is a wide range of possible preprocessing choices for data used in statistical machine translation. This is even more so for morphologically rich languages such as Arabic. In this paper, we study the effect of different word-level preprocessing schemes for Arabic on the quality of phrase-based statistical machine translation. We also present and evaluate different methods for combining preprocessing schemes resulting in improved translation quality.
منابع مشابه
Arabic Preprocessing Schemes for Statistical Machine Translation
In this paper, we study the effect of different word-level preprocessing decisions for Arabic on SMT quality. Our results show that given large amounts of training data, splitting off only proclitics performs best. However, for small amounts of training data, it is best to apply English-like tokenization using part-of-speech tags, and sophisticated morphological analysis and disambiguation. Mor...
متن کاملArabic-Segmentation Combination Strategies for Statistical Machine Translation
Arabic segmentation was already applied successfully for the task of statistical machine translation (SMT). Yet, there is no consistent comparison of the effect of different techniques and methods over the final translation quality. In this work, we use existing tools and further re-implement and develop new methods for segmentation. We compare the resulting SMT systems based on the different s...
متن کاملThe University of Washington machine translation system for IWSLT 2009
This paper describes the University of Washington’s system for the 2009 International Workshop on Spoken Language Translation (IWSLT) evaluation campaign. Two systems were developed, one each for the BTEC Chinese-to-English and Arabic-to-English tracks. We describe experiments with different preprocessing and alignment combination schemes. Our main focus this year was on exploring a novel semis...
متن کاملThe RWTH machine translation system for IWSLT 2008
RWTH’s system for the 2008 IWSLT evaluation consists of a combination of different phrase-based and hierarchical statistical machine translation systems. We participated in the translation tasks for the Chinese-to-English and Arabicto-English language pairs. We investigated different preprocessing techniques, reordering methods for the phrase-based system, including reordering of speech lattice...
متن کاملThe TALP n-gram-based SMT system for IWSLT 2007
This paper describes TALPtuples, the 2007 N -gram-based statistical machine translation system developed at the TALP Research Center of the UPC (Universitat Politècnica de Catalunya) in Barcelona. Emphasis is put on improvements and extensions of the system of previous years. Mainly, these include optimizing alignment parameters in function of translation metric scores and rescoring with a neur...
متن کامل